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Abstract 

We consider a class of sparse learning problems in high dimensional feature space regularized 
by a structured sparsity-inducing norm that incorporates prior knowledge of the group structure 
of the features. Such problems often pose a considerable challenge to optimization algorithms 
due to the non-smoothness and non-separability of the regularization term. In this paper, we 
focus on two commonly adopted sparsity-inducing regularization terms, the overlapping Group 
Lasso penalty Zi/^2-norm and the Zi/Zoo-norm. We propose a unified framework based on the 
augmented Lagrangian method, under which problems with both types of regularization and 
their variants can be efficiently solved. As one of the core building-blocks of this framework, 
we develop new algorithms using a partial-linearization/splitting technique and prove that the 
accelerated versions of these algorithms require O(^) iterations to obtain an e-optimal solu- 
tion. We compare the performance of these algorithms against that of the alternating direction 
augmented Lagrangian and FISTA methods on a collection of data sets and apply them to two 
real-world problems to compare the relative merits of the two norms. 

Keywords: structured sparsity, overlapping Group Lasso, alternating direction methods, variable 
splitting, augmented Lagrangian 

1 Introduction 

For feature learning problems in a high-dimensional space, sparsity in the feature vector is usually 
a desirable property. Many statistical models have been proposed in the literature to enforce 
sparsity, dating back to the classical Lasso model (/i-regularization) [351 EH] ■ The Lasso model is 
particularly appealing because it can be solved by very efficient proximal gradient methods; e.g., 
see [12]. However, the Lasso does not take into account the structure of the features f50]. In 
many real applications, the features in a learning problem are often highly correlated, exhibiting 
a group structure. Structured sparsity has been shown to be effective in those cases. The Group 
Lasso model [l9l [21 [IT] assumes disjoint groups and enforces sparsity on the pre-defined groups 
of features. This model has been extended to allow for groups that are hierarchical as well as 
overlapping [25l [271 13] with a wide array of applications from gene selection [27j to computer 
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vision |23[ I26| . For image denoising problems, extensions with non-integer block sizes and adaptive 
partitions have been proposed in |36l I37j . In this paper, we consider the following basic model of 
minimizing the squared-error loss with a regularization term to induce group sparsity: 



min L(x) + il.(x) , (1) 



where 



L{x) = ^\\Ax-bf, ^eR"^™, 

^ f ^h/i2i^) = ^Y.ses'^s\\xs\\, or 

S = {si,--- ,S|5|} is the set of group indices with \S\ = J, and the elements (features) in the 
groups possibly overlap [TTl [30l |25l [3] . In this model, A,?i;s,5 are all pre-defined. || • || without a 
subscript denotes the ^2-norm. We note that the penalty term Qi^/i^{x) in ^ is different from the 
one proposed in |23j, although both are called overlapping Group Lasso penalties. In particular, 
([l])-([2]) cannot be cast into a non-overlapping group lasso problem as done in 



1.1 Related Work 

Two proximal gradient methods have been proposed to solve a close variant of ([T]) with an Z1/Z2 
penalty, 

mm L{x) + Qi^/i^ix) + X\\x\\i, (3) 

which has an additional Zi-regularization term on x. Chen et al. replace fli-^/i^{x) with a 
smooth approximation i^riix) by using Nesterov's smoothing technique [33j and solve the resulting 
problem by the Fast Iterative Shrinkage Thresholding algorithm (FISTA) The parameter rj is a 
smoothing parameter, upon which the practical and theoretical convergence speed of the algorithm 
critically depends. Liu and Ye [29] also apply FISTA to solve Q, but in each iteration, they 
transform the computation of the proximal operator associated with the combined penalty term 
into an equivalent constrained smooth problem and solve it by Nesterov's accelerated gradient 
descent method [33]. Mairal et al. [30] apply the accelerated proximal gradient method to ([T]) with 
h/loo penalty and propose a network flow algorithm to solve the proximal problem associated with 
^h/iooi-^)- Mosci et al.'s method [32] for solving the Group Lasso problem in [2^ is in the same 
spirit as ^29j, but their approach uses a projected Newton method. 



1.2 Our Contributions 

We take a unified approach to tackle problem ([T]) with both I1/I2- and /i//oo-regularizations. Our 
strategy is to develop efficient algorithms based on the Alternating Linearization Method with 
Skipping (ALM-S) [T9] and FISTA for solving an equivalent constrained version of problem ([T]) (to 
be introduced in Section [2]) in an augmented Lagrangian method framework. Specifically, we make 
the following contributions in this paper: 

• We build a general framework based on the augmented Lagrangian method, under which 
learning problems with both h/h and h/loo regularizations (and their variants) can be solved. 
This framework allows for experimentation with its key building blocks. 
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• We propose new algorithms: ALM-S with partial splitting (APLM-S) and FISTA with partial 
linearization (FISTA-p), to serve as the key building block for this framework. We prove that 
APLM-S and FISTA-p have convergence rates of 0{j:) and O(p-) respectively, where k is 
the number of iterations. Our algorithms are easy to implement and tune, and they do not 
require line-search, eliminating the need to evaluate the objective function at every iteration. 

• We evaluate the quality and speed of the proposed algorithms and framework against state-of- 
the-art approaches on a rich set of synthetic test data and compare the /1//2 and h/loo models 
on breast cancer gene expression data |47j and a video sequence background subtraction task 

m- 



2 A Variable-Splitting Augmented Lagrangian Framework 

In this section, we present a unified framework, based on variable splitting and the augmented 
Lagrangian method for solving ([T]) with both I1/I2- and /i/^oo-regularizations. This framework 
reformulates problem ([T]) as an equivalent linearly-constrained problem, by using the following 
variable-splitting procedure. 

Let y G M^sgs 1^1 be the vector obtained from the vector x S by repeating components of x 
so that no component of y belongs to more than one group. Let M = X^sScS I'^l' relationship 
between x and y is specified by the linear constraint Cx = y, where the (i, j)-th element of the 
matrix C G M^^x"^ is 

^ _ f li if ?/i is a replicate of Xj, 
^'-^ 1^ 0, otherwise. 

For examples of C, refer to (llj . Consequently, ([T]) is equivalent to 

min Fobj{x,y) = ^\\Ax -bf + n{y) (5) 
s.t. Cx = y, 

where 0,{y) is the non-overlapping group-structured penalty term corresponding to ^{y) defined in 
([2]). Note that C is a highly sparse matrix, and D = C^C is a diagonal matrix with the diagonal 
entries equal to the number of times that each entry of x is included in some group. Problem 
([5]) now includes two sets of variables x and y, where x appears only in the loss term L{x) and y 
appears only in the penalty term ^{y). 

All the non-overlapping versions of including the Lasso and Group Lasso, are special cases 
of with C = I, i.e. x = y. Hence, ([s]) in this case is equivalent to applying variable-splitting 
on X. Problems with a composite penalty term, such as the Elastic Net, Ai||x||i + A2||x|p, can also 
be reformulated in a similar way by merging the smooth part of the penalty term (A2||x|p in the 
case of the Elastic Net) with the loss function L(x). 

To solve ([5]), we apply the augmented Lagrangian method [22l |38l IMl H] to it. This method. 



Algorithm 2.1 minimizes the augmented Lagrangian 



/:{x,y,v) = ^\\Ax-bf-v''{Cx-y) + ^\\Cx-yf + n{y) (6) 

exactly for a given Lagrange multiplier v in every iteration followed by an update to v. The 
parameter /i i n ([6| ) controls the amount of weight that is placed on violations of the constraint Cx 



y. Algorithm 2.1 can also be viewed as a dual ascent algorithm applied to P{v) = miux^y C{x,y,v) 



j5l, where v is the dual variable, - is the step-length, and Cx — y is the gradient V^P(w). This 
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Algorithm 2.1 AugLag 



1: Choosex , y , u . 

2: for / = 0, 1, - • • do 

3: (x'+\ ^ arg mina;,^ £(x, y, f ') 

4: i^c-^m _ym) 

5: Update fi according to the chosen updating scheme. 

6: end for 



algorithm does not require /i to be very small to guarantee convergence to the solution of problem 
([5]) [34J. However, solving the problem in Line [s] of Algorithm 2.1 exactly can be very challenging 



in the case of structured sparsity. We instead seek an approximate minimizer of the augmented 
Lagrangian via the abstract subroutine ApproxAugLagMin(x, y, u). The following theorem |40j 
guarantees the convergence of this inexact version of Algorithm |2.1[ 



Theorem 2.1. Let a' := C{x\y\v^) — inf^g]gm j,g]gM £(x, y, v') and F* be the optimal value of 
problem ([s]). Suppose problem ^ satisfies the modified Slater's condition, and 



^\/c?<+oo. (7) 
1=1 

Then, the sequence {v^} converges to v* , which satisfies 

inf {Fobjix,y) - {v*f{Cx - y)) = F* , 

while the sequence {x', y'} satisfies limj^oo Cx'' — y' = and lim^-^oo Fohj{x\y^) = F* . 

The condition ([T]) requires the augmented Lagrangian subproblem be solved with increasing 
accuracy. We formally state this framework in Algorithm |2.2[ We index the iterations of Algorithm 



Algorithm 2.2 OGLasso- AugLag 



Choose ,y^ . 
for / = 0, 1, • • • do 

{x^^^,y^^^) <— ApproxAugLagMin(x', y', t''), to compute an approximate minimizer of 
C{x,y,v^^ 



Update ^ according to the chosen updating scheme, 
end for 



2.2 by / and call them 'outer iterations'. In Sections |3| we develop algorithms that implement 
ApproxAugLagMin(x, y, u). The iterations of these subroutine are indexed by k and are called 
'inner iterations'. 

3 Methods for Approximately Minimizing the Augmented La- 
grangian 

In this section, we use the overlapping Group Lasso penalty r2(x) = A ^^^^ tt;s||xs|| to illustrate 
the optimization algorithms under discussion. The case of ^i/^oo-regularization will be discussed in 
Section |4] From now on, we assume without loss of generality that = 1 for every group s. 
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3.1 Alternating Direction Augmented Lagrangian (ADAL) Method 

The well-known Alternating Direction Augmented Lagrangian (ADAL) method |15| \T7\ \T8\ E 
approximately minimizes the augmented Lagrangian by minimizing ^ with respect to x and y 
alternatingly and then updates the Lagrange multiplier v on each iteration (e.g., see [7|, Section 3.4). 
Specifically, the single-iteration procedure that serves as the procedure ApproxAugLagMin(x, y, v) 



is given below as Algorithm 3.1 



Algorithm 3.1 ADAL 



Given x', y\ and vK 
x^~^^ <— argmiua; y', u') 
y'"*"^ ^ argmiuj^ £(x'"^^, y, 
return , y'"*"^ . 



The ADAL method, also known as the alternating direction method of multipliers (ADMM) and 
the split Bregman method, has recently been applied to problems in signal and image processing 
|12^ [H [20] and low-rank matrix recovery |2 8 ] . Its convergence has been established in (15]. This 
method can accommodate a sum of more than two functions. For example, by applying variable- 
splitting (e.g., see [3 [H]) to the problem min^. f{x) + J2i^i 9i{Cix), it can be transformed into 



K 

min f{x) + y^Qi{yi) 

x,yi,-- ,yK ^-^ 
i=\ 

s.t. yi = dx, i = 1, • • • ,K. 



The subproblems corresponding to y^'s can thus be solved simultaneously by the ADAL method. 
This so-called simultaneous direction method of multipliers (SDMM) 0^ is related to Spingarn's 
method of partial inverses ^43j and has been shown to be a special instance of a more general 
parallel proximal algorithm with inertia parameters |35j . 
Note that the problem solved in Line [3] of Algorithm 3.1 



y'+i = argmin£(a;'+-'^,y, v') = argmin < — - y|p -|- 0.{y) \ , (8) 
y y [2iJ. J 

where = Cx^~^^ — is group-separable and hence can be solved in parallel. As in 
each subproblem can be solved by applying the block soft-thresholding o pera tor, T{dg,fiX) 



max(0, \\dg\\ — A/i), s = 1, • • • , J. Solving for x'^^ in Line|2 of Algorithm 



3.1 



I.e. 



= argmin/:(x,y',t!') = argmin <j - 6|p - (t>')^Cx + -^WCx - y'|p } , (9) 



X 



involves solving the linear system 



(A'^A+^D)x = A^b + C^v^ + -C'^y\ (10) 



where the matrix on the left hand side of ( 10 ) has dimension mxm. Many real-world data sets, such 
as gene expression data, are highly under-determined. Hence, the number of features (m) is much 



^Recently, Mairal et al. [31] also applied ADAL with two variants based on variable-splitting to the overlapping 
Group Lasso problem. 
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larger than the number of samples (n). In such cases, one can use the Sherman-Morrison- Woodbury 
formula, 

(A^A + -D)-^ = fiD-^ - fi'^D-^A^il + iiAD-^A^)-^AD-\ (11) 

and solve instead an n x n linear system involving the matrix / + fiAD^^A^. In addition, as long 
as stays the same, one has to factorize A^A + or / + ^AD^^A^ only once and store their 
factors for subsequent iterations. 

When both n and m are very large, it might be infeasible to compute or store A^A, not to 
mention its eigen-decomposition, or the Cholesky decomposition of A'^A+ j^D. In this case, one can 
solve the linear systems using the preconditioned Conjugate Gradient (PCG) method |21j . Similar 



comments apply to the other algorithms proposed in Sections 3.2 - 13.4| below. Alternatively, we can 
apply FISTA to Line [3] in Algorithm |2.2| (see Section 3.5). 



3.2 ALM-S: partial split (APLM-S) 

We now consider applying the Alternating Linearization Method with Skipping (ALM-S) from |19| 
to approximately minimize ([6]). In particular, we apply variable splitting (Section [2| to the variable 
y, to which the group-sparse regularizer il. is applied, (the original ALM-S splits both variables x 
and y,) and re-formulate ^ as follows. 

min -\\Ax-b\\^-v^{Cx-y) + —\\Cx-yf + n{y) (12) 
x,y,y 2 2/x 

s.t. y = y. 

Note that the Lagrange multiplier v is fixed here. Defining 

/(x,y) := \\\Ax-hf-v^{Cx-y) + ^\\Cx-y\\\ (13) 
2 2/x 

g{y) = n{y) = xY^hsW, (14) 



problem ( 12 ) is of the form 



min f{x,y) + g{y) (15) 
s.t. y = y, 
to which we now apply partial-linearization. 

3.2.1 Partial linearization and convergence rate analysis 

Let us define 

F{x,y) := f{x,y)+g{y) = C{x,y;v), (16) 

^p{x,y,y,i) ■■= f{x,y) + g{y) +"f^{y - y) + —\\y -yf, (17) 

zp 



where 7 is the Lagrange multiplier in the augmented Lagrangian (17) corresponding to problem 



(15), and ^g{y) is a sub-gradient of g at y. We now present our partial-split alternating linearization 



algorithm to implement ApproxAugLagMin(a;, y, v) in Algorithm 2.2 
We note that in Line [6] in Algorithm |3.2[ 

x^^^ = argmin£p(x; y'^^^, y'^, 7'^) = argmin/(x; y'^^"'^) = arg min /(x; y'^). (18) 
Now, we have a variant of Lemma 2.2 in I19I . 
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Algorithm 3.2 APLM-S 



2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 



Given x^,y^,v. Choose /J, 7°, such that —7° G dg{ip). Define f{x,y) as in (13). 
for A: = 0, 1, • • • until stopping criterion is satisfied do 

if 2/^^+^) > £p(x'^+^2/'''+^,y'^,7^) then 

rj-k+i ^ argmin^ /(x, y^^^) = argminj; Cp{x; y^'^^, y^,"f^) 
end if 

yk+i ^ p^(x*=+i,y'=+i) = argminj^£p(x*=+\2/''+\y, Vj^/(x''+\y''+^)) 

^fc+i ^ Vyf{x''+\y^+') - y'^Y^' 

end for 

return {x^^^,y^^^) 



Lemma 3.1. For any {x,y), if q := argminy Cp{x,y,y,Vyf{x,y)) and 

F{x,q) < Cp{x,y,q,Vyf{x,y)), 

then for any {x, y), 

2p{F{x, y) - F{x, q)) > \\q - yf - \\y - yf + 2p{{x - xf V,/(x, y)). 
Similarly, for any y, if {p,q) := argmin^j^y £p(x, y, y, -7g(y)) and 

F{(p, q)) < Cp{{p, q),y, --fg{y)), 

then for any (x, y), 



2p{F{x,y)-F{{p,q))) > \\{p, q)y - yf - \\y - yf . 



Proof. See Appendix \K\ 



(19) 

(20) 

(21) 

(22) 
□ 



Algorithm 3.2 checks condition (21 ) at Line|4]because the function g is non-smooth and condition 
(21) may not hold no matter what the value of p is. When this condition is violated, a skipping 
step occurs in which the value of y is set to the value of y in the previous iteration (Line [5]) and Cp 
re-minimized with respect to x (Line [G]) to ensure convergence. Let us define a regular iteration of 
Algorithm 3.2 to be an iteration where no skipping step occurs, i.e. Lines [5] and [6] are not executed. 
Likewise, we define a skipping iteration to be an iteration where a skipping step occurs. Now, we 
are ready to state the iteration complexity result for APLM-S. 

Theorem 3.1. Assume that ^yf{x^y) is Lipschitz continuous with Lipschitz constant Ly{f), i.e. 



for any x, ||Vy/(x,y) - Vy/(x,2;)|| < Ly{f)\\y - z\\, for all y and z. For p < 
[x^,y^) in Algorithm 3.2 satisfy 



Ly(f) 



F{x\f)-F{x\y*)< 



\y - y 



*||2 



2p{k + k„ 



Wk, 



, the iterates 



(23) 



where {x*,y*) is an optimal solution to (12), and kn is the number of regular iterations among the 
first k iterations. 



Proof. See Appendix! 



□ 
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Remark 3.1. For TheoremlS.llto hold, we need p < jj'jfj- From the definition of f{x,y) in (13), 
it is easy to see that Ly{f) = - regardless of the loss function L{x). Hence, we set p = fi, so that 



condition ( 19 ) in Lemma \3.1\ is satisfied. 



In Section |3.3| , we will discuss the case where the iterations entirely consist of skipping steps. 
We will show that this is equivalent to ISTA with partial linearization as well as a variant of 
ADAL. In this case, the inner Lagrange multiplier 7 is redundant. 



3.2.2 Solving the subproblems 



We now show how to solve the subproblems in Algorithm 3.2 First, observe that since p = n 



arg min Cp{x, y, y, \/yf{x, y)) 
y 



arg min | Vyf{x, y)^y + ^\\y - yf + g{y) 
y [ 2/i 



arg min 

y 



(24) 
(25) 



where d = Cx — p,v. Hence, y can be obtained by applying the block soft-thresholding operator 



T{ds,pX) as in Section 3.1 Next consider the subproblem 

min Cp{x,y,y,-f) = min | /(x, y) + 7^(y - y) + t^Wv - vl 
(x,y) {x,y) L ^/^ 



(26) 



It is easy to verify that solving the linear system given by the optimality conditions for (26) by 
block Gaussian elimination yields the system 



2p 



for computing x, where r^ = A'^b + C'^v and 
(f)(r, + iCx). _ 
As in Section 



A^A + ^d] x = r, + hj^ry 



(27) 



+ 7 + ^. Then y can be computed as 



Algorithm 3.2, E 



3.1 



only one Cholesky factorization of A"^ A+j^D is required for each invocation of 
ence, the amount of work involved in each iteration of Algorithm |3.2| is comparable 
to that of an ADAL iteration. 



It is straightforward to derive an accelerated version of Algorithm 3.2, which we shall refer to as 
FAPLM-S, that corresponds to a partial-split version of the FALM algorithm p ropo sed in [19j and 

also requires 0{\J^^^) iterations to obtain an e-optimal solution. In Section 3.4, we present an 
algorithm FISTA-p, which is a special version of FAPLM-S in which every iteration is a skipping 
iteration and which has a much simpler form than FAPLM-S, while having essentially the same 
iteration complexity. 

It is also possible to apply ALM-S directly, which splits both x and y, to solve the augmented 
Lagrangian subproblem. Similar to (12), we reformulate ^ as 



mm 

{x,y),{x,y) 

s.t. 



l\\Ax - bf - v'^iCx - y) + ^\\Cx - yf + 



2 

X = X 

y = y' 



(28) 



The functions / and g are defined as in (13) and (14), except that now we write g as g{x,y) even 



though the variable x does not appear in the expression for g. It can be shown that y admits exactly 
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the same expression as in APLM-S, whereas x is obtained by a gradient step, x — pVxf{x, y). To 
obtain x, we solve the hnear system 



Tx + 



p 



P + P 



(29) 



after which y is computed by y 



ry + iCx 



Remark 3.2. For ALMS, the Lipschitz constant for Vf{x,y) Lf = Xmax{A^ ^) + j^dmax, where 
dmax = maxj Da > 1. For the complexity results in fW^ to hold, we need p < j^. Since \max{A^ 
is usually not known, it is necessary to perform a backtracking line-search on p to ensure that 
F{x''^^ ,y^^^) < Cp{x^^^ ,y^~^^ ,x^ ,y^ ,'~^^). In practice, we adopted the following continuation 
scheme instead. We initially set p = po = and decreased p by a factor of j3 after a given 

number of iterations until p reached a user-supplied minimum value pmin- This scheme prevents p 
from being too small, and hence negatively impacting computational performance. However, in both 



cases the left-hand- side of the system ( 29 ) has to be re-factorized every time p is updated. 



As we have seen above, the Lipschitz constant resulting from splitting both x and y is potentially 
much larger than ^. Hence, partial-linearization reduces the Lipschitz constant and hence improves 



the bound on the right-hand-side of (23) and allows Algorithm 3.2 to take larger step sizes (equal 



to p). Compared to ALM-S, solving for x in the skipping step (Line [6]) becomes harder. Intuitively, 
APLM-S does a better job of 'load-balancing' by managing a better trade-off between the hardness 
of the subproblems and the practical convergence rate. 



3.3 ISTA: partial linearization (ISTA-p) 

We can also minimize the augmented Lagrangian (6]), which we write as £(x, y, v) = f{x, y) -\- g{y) 
with f{x,y) and g{y) defined as in (13) and ([l4), using a variant of ISTA that only linearizes 



f{x, y) with respect to the y variables. As in Section |3.2| we can sei p = p and guarantee the 



convergence properties of ISTA-p (see Corollary 3.1 below). Formally, let {x,y) be the current 
iterate and {x~^ ,y^) be the next iterate. We compute y^ by 



y+ = argmin£p(a;,y,y', Vj;/(x,y)) 
y' 



arg mm 

y' 



+ A||2/' 




(30) 
(31) 



where dy = Cx — pv. Hence the solution y^ to problem (31 ) is given blockwise by T{[dy]j, pX),j 
I,-- - ,J. 

Now given y"*", we solve for x^ by 



arg min/(x', y"*") 

x' 

1 



arg mm 

x' 



1 



Ax' - bf - v^{Cx' - y+) + ^\\Cx' 



(32) 



The algorithm that implements subroutine ApproxAugLagMin(x, y, ^;) in Algorithm 2.2 by ISTA 



with partial linearization is stated below as Algorithm 3.3 



As we remarked in Section 3.2, Algorithm 3.3 is equivalent to Algorithm 3.2 (APLM-S) where 



every iteration is a skipping iteration. Hence, we have from Theorem 3.1 
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Algorithm 3.3 ISTA-p (partial linearization) 



Given x^,y^,v. Choose p. Define f{x,y) as in (13). 



for k = 0,1, ■ ■ ■ until stopping criterion is satisfied do 

y^+i ^ a.TgmmyCp{x^~^'^,y'',y,\/yf{x^+^,y'')) 
end for 

return {x^~^^,y^~^^) 



Corollary 3.1. Assume Vy/(-,-) is Lipschitz continuous with Lipschitz constant Ly{f). For p < 
j;-jfj! the iterates {x^,y^) in Algorithm 3.3 satisfy 



F{x^f) 



F{x*,y*) < 



\y - y 

2pk 



*||2 



(33) 



where {x*,y*) is an optimal solution to (12) 



It is easy to see that (31) is equivalent to mh, and that (32) is the same as n9h in ADAL 



Remark 3.3. We have shown that with a fixed v, the ISTA-p iterations are exactly the same as 
the ADAL iterations. The difference between the two algorithms is that ADAL updates the (outer) 
Lagrange multiplier v in each iteration, while in ISTA-p, v stays the same throughout the inner 
iterations. We can thus view ISTA-p as a variant of ADAL with delayed updating of the Lagrange 
multiplier. 



The 'load-balancing' behavior discussed in Section [3.2| is more obvious for ISTA-p. As we will 
see in Section 3.5, if we apply ISTA (with full linearization) to minimize ([g]), solving for x is simply 
a gradient step. Here, we need to minimize f{x, y) with respect to x exactly, while being able to 
take larger step sizes in the other subproblem, due to the smaller associated Lipschitz constant. 



3.4 FISTA-p 

We now present an accelerated version FISTA-p of ISTA-p. FISTA-p is a special case of FAPLM-S 



with a skipping step occurring in every iteration. We state the algorithm formally as Algorithm 3.4 
The iteration complexity of FISTA-p (and FAPLM-S) is given by the following theorem. 



Algorithm 3.4 FISTA-p (partial linearization) 



Given x^,y^,v. Choose p, and z^ 
for A; = 0,1,- - ,K do 



y . Define f{x,y) as in (13) 



y 



tk+i • 



z 

end for 



argmina; f{x; z^) 

arg miny Cp{x^^^,z^, y, Vyf{x^~^^, z*^)) 

2 

fc+1 I f tk-A (r,k+^ .-M 



y 



+ 



tk + 1 

return (x^^^ ,y^^^) 



(r 



r) 
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Theorem 3.2. Assuming that 'Vyf{-) is Lipschitz continuous with Lipschitz constant Ly(f) and 



P ^ Tlj)' sequence {x ,y } generated by Algorithm 3.4 satisfies 



F{x 



Fix*,y*) < 



\y - y 



p(fc + l)2 



(34) 



Although we need to solve a linear system in every iteration of Algorithms 3.2 3.3, and 3.4, the 
left-hand-side of the system stays constant throughout the invocation of the algorithms because, 
following Remark 3.1, we can always set p = ^. Hence, no line-search is necessary, and this step 
essentially requires only one backward- and one forward-substitution, the complexity of which is 
the same as a gradient step. 



3.5 ISTA/FISTA: full linearization 

f x^ 

ISTA solves the following problem in each iteration to produce the next iterate ( _, 





1 


( x' ^ 




min 


2p 




x' ,y' 


\y' J 





+ \ 



s 



\ys 



2p' 



+ Ally' I 



(35) 



where d 



d^ 



d„ 



X 

J \ y 

can solve for x~^ and y~^ separately in {[SSl). Specifically 



pVf{x,y), and f{x,y) is defined in (13). It is easy to see that we 



X 



dx 

d.^ 



Id, 



max(0, \\dy \\ — Xp), 



1,...,J. 



(36) 
(37) 



Using ISTA to solve the outer augmented Lagrangian ^ subproblem is equivalent to taking only 
skipping steps in ALM-S. In our experiments, we used the accelerated version of ISTA, i.e. FISTA 
(Algorithm |3.5| ) to solve 

FISTA (resp. ISTA) is, in fact, an inexact version of FISTA-p (resp. ISTA-p), where we 
minimize with respect to x a linearized approximation 



fix, z'^) := /(x^ z^) + V,/(x^ z'){x - x^) + 



2/>' 



3.4 



of the quadratic objective function f{x,z^) in (32). The update to x in Line [s] of Algorithm 
replaced by (36) as a result. Similar to FISTA-p, FISTA is also a special skipping version of the 



full-split FALM-S. Considering that FISTA has an iteration complexity of O(^), it is not surprising 
that FISTA-p has the same iteration complexity. 

Remark 3.4. Since FISTA requires only the gradient of f{x,y), it can easily handle any smooth 
convex loss function, such as the logistic loss for binary classification, L{x) = X^i^i log(l + 
exp(— 6joJ"x)), where af is the i-th row of A, and b is the vector of labels. Moreover, when the 
scale of the data (min{n, m}) is so large that it is impractical to compute the Cholesky factor- 
ization of AJ- A, FISTA is a good choice to serve as the subroutine ApproxAugLagMin{x,y,v) in 
OGLasso-AugLag. 
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Algorithm 3.5 FISTA 



10 
11 



Given sf',vp,v. Choose . Set io = 0,z^ = x^,Zy = ]f . Define f{x,y) as in (13) 
for A; = 0, 1, • • • until stopping criterion is satisfied do 

Perform a backtracking line-search on p, starting from po. 



y J \ y J 



5: X^^"^ *r- d.j; 
-,k+l 



dy 
dy 

l+Vl+4t| 



2 



tk+1 

end for 

return (x^^^ , y^^^) 



4 Overlapping Group /i//oo-Regularization 

The subproblems with respect to y (or y) involved in all the algorithms presented in the previous 
sections take the following form 

mm^\\c-y\\^ + n{y), (38) 
y 2p 

where ri{y) = X J2seS ''^sWUsWoo in the case of li/loo-regularization. In ([s]), for example, c = Cx — pv. 



The solution to (38) is the proximal operator of Q [TH [l2]. Similar to the classical Group Lasso, 
this problem is block-separable and hence all blocks can be solved simultaneously. 

Again, for notational simplicity, we assume = 1 Vs G 5 and omit it from now on. For each 



s € 5, the subproblem in (38) is of the form 



min \\\cs - ys\\^ + pXWVsWoo- (39) 
ys I 

As shown in [38], the optimal solution to the above problem is Cg — P{cs), where P denotes the 
orthogonal projector onto the ball of radius p\ in the dual norm of the loo-norm, i.e. the /i-norm. 
The Euclidean projection onto the simplex can be computed in (expected) linear time |14l [9]. Duchi 
et al. [Uj show that the problem of computing the Euclidean projection onto the /i-ball can be 
reduced to that of finding the Euclidean projection onto the simplex in the following way. First, 



we replace Cs in problem (39) by \cs\, where the absolute value is taken component-wise. After 
we obtain the projection Zs onto the simplex, we can construct the projection onto the /i-ball by 
setting y* = sign{cs)zs, where sign{-) is also taken component-wise. 

5 Experiments 



We tested the OGLasso-AugLag framework (Algorithm 2.2) with four subroutines: ADAL, APLM- 
S, FISTA-p, and FISTA. We implemented the framework with the first three subroutines in C-|— |- 
to compare them with the ProxFlow algorithm proposed in |30j . We used the C interface and 



BLAS and LAPACK subroutines provided by the AMD Core Math Library (ACMLQ 

To compare 

^http:/ /developer. amd.com/libraries/acml/pages/default.aspx. Ideally, we should have used the Intel Math Kernel 
Library (Intel MKL), which is optimized for Intel processors, but Intel MKL is not freely available. 
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Algorithm 


Outer rel. dual residual s'^^ 


Inner iteration 


Rel. primal residual 


Rel. objective gradient residual 


ADAL 


\\c''{y'+^-y')\\ 
l|c^j/'l! 






FISTA-p 




j|C'J'(yK+l_^K)|| 






ij^fc+i_^fci 
















APLM-S 




||(;r(j^K+l_^K+l)|| 






lj^fc+i_yfe+ij 






||C.r(j;/c+i_j^fc+i||) 






Ijyfc+iH 


||CTj;'=+l|| 


FISTA 






















(1) 






(1) 






(?) 





Table 1: Specification of the quantities used in the outer and inner stopping criteria. 



with ProxGrad we implemented the framework and all four algorithms in Matlab. We did 
not include ALM-S in our experiments because it is time-consuming to find the right p for the 



inner loops as discussed in Remark 3.2, and our preliminary computational experience showed that 



ALM-S was slower than the other algorithms, even when the heuristic p-setting scheme discussed 



in Remark 3.2 was used, because a large number of steps were skipping steps, which meant that the 
computation involved in solving the linear systems in those steps was wasted. All of our experiments 
were performed on a laptop PC with an Intel Core 2 Duo 2.0 GHz processor and 4 Gb of memory. 

5.1 Algorithm parameters and termination criteria 

Each algorithm (framework -|- subroutine)!^ required several parameters to be set and termination 
criteria to be specified. We used stopping criteria based on the primal and dual residuals as in [8j. 
We specify the criteria for each of the algorithms below, but defer their derivation to Appendix [C) 
The maximum number of outer iterations was set to 500, and the tolerance for the outer loop was 
set at eout = 10~^. The number of inner- iterations was capped at 2000, and the tolerance at the 
l-th outer iteration for the inner loops was e^„. Our termination criterion for the outer iterations 
was 

max{r^,s^} < eout, (40) 
where r = 'L^ .u is the outer relative primal residual and s is the relative dual residual, 

max|||Ca;'||,|jy'||| ^ ' 

which is given for each algorithm in Table [l] Recall that ET-l-l is the index of the last inner iteration 
of the l-th. outer iteration; for example, for APLM-S, (x'"*"^, y'"''^) takes the value of the last inner 
iterate (x^"*"^, y^"^^). We stopped the inner iterations when the maximum of the relative primal 
residual and the relative objective gradient for the inner problem was less than e^„. (See Table [l] 
for the expressions of these two quantities.) We see there that s'"*"^ can be obtained directly from 
the relative gradient residual computed in the last inner iteration of the Z-th outer iteration. 

We set fiQ = 0.01 in all algorithms except that we set po = 0.1 in ADAL for the data sets other 
than the first synthetic set and the breast cancer data set. We set p = fi in FISTA-p and APLM-S 
and po = p in FISTA. 



For Theorem 2.1 to hold, the solution returned by the function ApproxAugLagMin(x, y, v) has 



to become increasingly more accurate over the outer iterations. However, it is not possible to 



''For conciseness, we use the subroutine names (e.g. FISTA-p) to represent the fuU algorithms that consist of the 
OGLasso-AugLag framework and the subroutines. 
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evaluate the sub-optimality quantity a' in ([7]) exactly because the optimal value of the augmented 
Lagrangian C{x,y,v^) is not known in advance. In our experiments, we used the maximum of the 
relative primal and dual residuals (max{ r',s'}) as a surrogate to a' for two reasons: First, it has 
been shown in [8] that r' and are closely related to a'. Second, the quantities and s' are 
readily available as bi-products of the inner and outer iterations. To ensure that the sequence 
satisfies ([7]), we basically set: 

with e-*^ = 0.01 and = 0.5. However, since we terminate the outer iterations at gout > 0, it is 
not necessary to solve the subproblems to an accuracy much higher than the one for the outer loop. 
On the other hand, it is also important for to decrease to below eout, since is closely related 



to the quantities involved in the inner stopping criteria. Hence, we slightly modified (41) and used 



max{/3jne-„,0.2eont}. 

Recently, we became aware of an alternative 'relative error' stopping criterion [16j for the inner 

In our context, this criterion essentially 



loops, which guarantees convergence of Algorithm 2.2 



requires that the absolute dual residual is less than a fraction of the absolute primal residual. For 
FISTA-p, for instance, this condition requires that the {I + l)-th iterate satisfies 



wi, 



.,1+1 



where r and s are the numerators in the expressions for r and s respectively, a 
stant, and Wy is an auxiliary variable updated in each outer iteration by w. 



i+i 
y 



K+l 



z" ). We experimented with this criterion but did not find any computational advantage over the 
heuristic based on the relative primal and dual residuals. 



5.2 Strategies for updating jj, 

The penalty parameter ^ in the outer augmented Lagrangian ([g]) not only controls the infeasibility in 
the constraint Cx = y, but also serves as the step-length in the y-subproblem (and the x-subproblem 
in the case of FISTA). We adopted two kinds of strategies for updating fi. The first one simply 
kept /i fixed. In this case, choosing an appropriate was important for good performance. This 
was especially true for ADAL in our computational experiments. Usually, a /iq in the range of 10~^ 
to 10"'^ worked well. 

The second strategy is a dynamic scheme based on the values and [8]. Since ^ penalizes 
the primal infeasibility, a small // tends to result in a small primal residual. On the other hand, a 
large fj, tends to yield a small dual residual. Hence, to keep r' and approximately balanced in 
each outer iteration, our scheme updated /i as follows: 

{max{Pfj.\firnin}, if > Ts'- 
min{ ^7/3, /"max}, if s'>Tr' (42) 
otherwise, 

where we set p.max = 10, /i-mm = 10~^, r = 10 and /3 = 0.5, except for the first synthetic data set, 
where we set /3 = 0.1 for ADAL, FISTA-p, and APLM-S. 



5.3 Synthetic examples 

To compare our algorithms with the ProxGrad algorithm proposed in [1 Ij , we first tested a synthetic 
data set (ogl) using the procedure reported in and ^24j. The sequence of decision variables x 
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were arranged in groups of ten, with adjacent groups having an overlap of three variables. The 
support of X was set to the first half of the variables. Each entry in the design matrix A and the 
non-zero entries of x were sampled from i.i.d. standard Gaussian distributions, and the output b 
was set to 6 = Ax + e, where the noise e ~ AA(0, /). Two sets of data were generated as follows: 
(a) Fix n = 5000 and vary the number of groups J from 100 to 1000 with increments of 100. (b) 
Fix J = 200 and vary n from 1000 to 10000 with increments of 1000. The stopping criterion for 
ProxGrad was the same as the one used for FISTA, and we set its smoothing parameter to 10~^. 
Figure [l] plots the CPU times taken by the Matlab version of our algorithms and ProxGrad (also 
in Matlab) on theses scalability tests on Zi/Z2-regularization. A subset of the numerical results on 
which these plots are based is presented in Tables [4] and [5] 

The plots clearly show that the alternating direction methods were much faster than ProxGrad 
on these two data sets. Compared to ADAL, FISTA-p performed slightly better, while it showed 
obvious computational advantage over its general version APLM-S. In the plot on the left of Figure 
[T| FISTA exhibited the advantage of a gradient-based algorithm when both n and m are large. 
In that case (towards the right end of the plot), the Cholesky factorizations required by ADAL, 
APLM-S, and FISTA-p became relatively expensive. When min{n, m} is small or the linear systems 
can be solved cheaply, as the plot on the right shows, FISTA-p and ADAL have an edge over FISTA 
due to the smaller numbers of inner iterations required. 

We generated a second data set (dct) using the approach from [30j for scalability tests on 
both the I1/I2 and h/loo group penalties. The design matrix A was formed from over-complete 
dictionaries of discrete cosine transforms (DCT). The set of groups were all the contiguous sequences 
of length five in one-dimensional space, x had about 10% non-zero entries, selected randomly. We 
generated the output as 6 = Ax + e, where e ~ AA(0, 0.OlHAxp). We fixed n = 1000 and varied 
the number of features m from 5000 to 30000 with increments of 5000. This set of data leads to 
considerably harder problems than the previous set because the groups are heavily overlapping, and 
the DCT dictionary-based design matrix exhibits local correlations. Due to the excessive running 
time required on Matlab, we ran the C-|— |- version of our algorithms for this data set, leaving out 
APLM-S and ProxGrad, whose performance compared to the other algorithms is already fairly 
clear from Figure [l] For ProxFlow, we set the tolerance on the relative duality gap to 10"'^, the 
same as Cout, and kept all the other parameters at their default values. 

Figure [2] presents the CPU times required by the algorithms versus the number of features. In 
the case of Zi/Z2-regularization, it is clear that FISTA-p outperformed the other two algorithms. 
For h/l 00-regularization, ADAL and FISTA-p performed equally well and compared favorably to 
ProxFlow. In both cases, the growth of the CPU times for FISTA follows the same trend as that 
for FISTA-p, and they required a similar number of outer iterations, as shown in Tables [6] and [7| 
However, FISTA lagged behind in speed due to larger numbers of inner iterations. Unlike in the 
case of the ogl data set, Cholesky factorization was not a bottleneck for FISTA-p and ADAL here 
because we needed to compute it only once. 

To simulate the situation where computing or caching A^A and its Cholesky factorization is 
not feasible, we switched ADAL and FISTA-p to PCG mode by always using PCG to solve the 
linear systems in the subproblems. We compared the performance of ADAL, FISTA-p, and FISTA 
on the previous data set for both /1//2 and h/loo models. The results for ProxFlow are copied 
from from Figure [2] and Table [9] to serve as a reference. We experimented with the fixed-value 
and the dynamic updating schemes for n on all three algorithms. From Figure [3j it is clear that 
the performance of FISTA-p was significantly improved by using the dynamic scheme. For ADAL, 
however, the dynamic scheme worked well only in the h/h case, whereas the performance turned 
worse in general in the h/loo case. We did not include the results for FISTA with the dynamic 
scheme because the solutions obtained were considerably more suboptimal than the ones obtained 
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Figure 1: Scalability test results of the algorithms on the synthetic overlapping Group Lasso data 
sets from [11] . The scale of the y-axis is logarithmic. The dynamic scheme for fj, was used for all 
algorithms except ProxGrad. 

with the fixed-// scheme. Tables [8] and [9] report the best results of the algorithms in each case. The 
plots and numerical results show that FISTA-p compares favorably to ADAL and stays competitive 
to ProxFlow. In terms of the quality of the solutions, FISTA-p and ADAL also did a better job 
than FISTA, as evidenced in Table [9} On the other hand, the gap in CPU time between FISTA 
and the other three algorithms is less obvious. 

5.4 Real- world Examples 

To demonstrate the practical usefulness of our algorithms, we tested our algorithms on two real- 
world applications. 

5.4.1 Breast Cancer Gene Expressions 

We used the breast cancer data set fl7l with canonical pathways from MSigDB [44J. The data 
was collected from 295 breast cancer tumor samples and contains gene expression measurements 
for 8,141 genes. The goal was to select a small set of the most relevant genes that yield the best 
prediction performance. A detailed description of the data set can be found in |1H . In our 
experiment, we performed a regression task to predict the length of survival of the patients. The 
canonical pathways naturally provide grouping information of the genes. Hence, we used them as 
the groups for the group-structured regularization term f^(-). 

Table [2] summarizes the data attributes. The numerical results for the Zi/?2-norm are collected 
in Table [Toj which show that FISTA-p and ADAL were the fastest on this data set. Again, we had 
to tune ADAL with different initial values (/^o) and updating schemes of for speed and quality 
of the solution, and we eventually kept fi constant at 0.01. The dynamic updating scheme for /i 
also did not work for FISTA, which returned a very suboptimal solution in this case. We instead 
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Fi gure 2: Scalability test results on the DCT set with Zi/Z2-regularization (left column) and l\/loo~ 
regularization (right column). The scale of the y-axis is logarithmic. All of FISTA-p, FITSA, and 
ADAL were run with a fixed fi = fiQ. 



Data sets 


N (no. samples) 


J (no. groups) 


group size 


average frequency 


BreastCancerData 


295 


637 


23.7 (avg) 


4 



Table 2: The Breast Cancer Dataset 



adopted a simple scheme of decreasing fi by half every 10 outer iterations. Figure [6] graphically 
depicts the performance of the different algorithms. In terms of the outer iterations, APLM-S 
behaved identically to FISTA-p, and FISTA also behaved similarly to ADAL. However, APLM-S 
and FISTA were considerably slower due to larger numbers of inner iterations. 

We plot the root-mean-squared-error (RMSE) over different values of A (which lead to different 
numbers of active genes) in the left half of Figure |4j The training set consists of 200 randomly 
selected samples, and the RMSE was computed on the remaining 95 samples. Zi/Z2-regularization 
achieved lower RMSE in this case. However, /i/Zoo-regularization yielded better group sparsity 
as shown in Figure [5] The sets of active genes selected by the two models were very similar as 
illustrated in the right half of Figure |4j In general, the magnitudes of the coefficients returned 
by h/l oo-regularization tended to be similar within a group, whereas those returned by I1/I2- 
regularization did not follow that pattern. This is because /i //oo-regularization penalizes only the 
maximum element, rather than all the coefficients in a group, resulting in many coefficients having 
the same magnitudes. 

5.4.2 Video Sequence Background Subtraction 

We next considered the video sequence background subtraction task from |3Ul I23j . The main 
objective here is to segment out foreground objects in an image (frame), given a sequence of m 
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Fi gure 3: Scalability test results on the DCT set with Zi/Z2-regularization (left column) and l\/loo~ 
regularization (right column). The scale of the y-axis is logarithmic. FISTA-p and ADAL are in 
PCG mode. The dotted lines denote the results obtained with the dynamic updating scheme for ^. 

frames from a fixed camera. The data used in this experiment is available online]^ |46j . The basic 
setup of the problem is as follows. We represent each frame of n pixels as a column vector Aj £ M" 
and form the matrix A £ as A = Ai A2 • • • Am ). The test frame is represented by 

b G M". We model the relationship between b and j4 by 6 ~ Ax + e, where x is assumed to be 
sparse, and e is the 'noise' term which is also assumed to be sparse. Ax is thus a sparse linear 
combination of the video frame sequence and accounts for the background present in both A and 
b. e contains the sparse foreground objects in b. The basic model with /i-regularization (Lasso) is 

minhAx + e- bf + \{\\x\\i + ||e||i). (43) 

x,e 2 

It has been shown in j30j that we can significantly improve the quality of segmentation by applying 
a group-structured regularization on e, where the groups are all the overlapping k x /c-square 
patches in the image. Here, we set k = 3. The model thus becomes 

min + e - bf + X{\\x\\i + ||e||i + 17(e)). (44) 

x,e 2 



Note that (44) still fits into the group-sparse framework if we treat the /i-regularization terms as 
the sum of the group norms, where the each groups consists of only one element. 

We also considered an alternative model, where a Ridge regularization is applied to x and an 
Elastic-Net penalty [50] to e. This model 

min ^px + e - bf + Ai||e||i + A2(||xf + ||ef ) (45) 

x,e 2 



'http:/ /research. microsoft.com/en-us/um/people/jckrumm/wallflower/testimages. htm 
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Figure 4: On the left: Plot of root-mean-squared-error against the number of active genes for the 
Breast Cancer data. The plot is based on the regularization path for ten different values for A. The 
total CPU time (in Matlab) using FISTA-p was 51 seconds for /i//2-regularization and 115 seconds 
for h/l oo-regularization. On the right: The recovered sparse gene coefficients for predicting the 
length of the survival period. The value of A used here was the one minimizing the RMSE in the 
plot on the left. 




06 1 15 0.2 26 

gene-level sparsit^ 



Figure 5: Pathway-level sparsity v.s. Gene-level sparsity. 



does not yield a sparse x, but sparsity in x is not a crucial factor here. It is, however, well suited for 
our partial linearization methods (APLM-S and FISTA-p), since there is no need for the augmented 
Lagrangian framework. Of course, we can also apply FISTA to solve (45). 

We recovered the foreground objects by solving the above optimization problems and applying 
the sparsity pattern of e as a mask for the original test frame. A hand-segmented evaluation image 
from [56! served as the ground truth. The regularization parameters A,Ai, and A2 were selected 
in such a way that the recovered foreground objects matched the ground truth to the maximum 
extent. 

FISTA-p was used to solve all three models. The li model (43) was treated as a special case of 
the group regularization model ( 44 ) , with each group containing only one component of the feature 
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Obj value v.s. CPU time for BreastCancerData 

Obj value ¥s Her for BreastCancerData 4600 r 




Log(CPU time) (s) 



Figure 6: Objective values v.s. Outer iters and Objective values v.s. CPU time plots for the Breast 
Cancer data. The results for ProxGrad are not plotted due to the different objective function that 
it minimizes. The red (APLM-S) and blue (FISTA-p) lines overlap in the left column. 



vector. For the Ridge/Elastic-Net penalty model, we applied FISTA-p directly without the outer 
augmented Lagrangian layer. 

The solutions for the h/h^h/^oo, and Lasso models were not strictly sparse in the sense that 
those supposedly zero feature coefficients had non-zero (albeit extremely small) magnitudes, since 
we enforced the linear constraints Cx = y through an augmented Lagrangian approach. To obtain 
sparse solutions, we truncated the non-sparse solutions using thresholds ranging from 10-^ to 10~^ 
and selected the threshold that yielded the best accuracy. 

Note that because of the additional feature vector e, the data matrix is effectively A = 
(A In) ^ i^nx(m+n)_ solving dil), FISTA-p has to solve the linear system 




(46) 

where D is a diagonal matrix, and Dx, De,rx,re are the components of D and r corresponding to 
X and e respectively. In this example, n is much larger than m, e.g. n = 57600, m = 200. To avoid 
solving a system of size n x n, we took the Schur complement of /„ + j^De and solved instead the 
positive definite m x m system 

A^A+-Dx-A^iI+-De)-^A]x = rx-A^iI+-De)'^re, (47) 



e = diag{l+^De)~'^{re- Ax). (48) 

The h/loo model yielded the best background separation accuracy (marginally better than the 
h/h model), but it also was the most computationally expensive. (See Table [s] and Figure [Tj) 
Although the Ridge/Elastic-Net model yielded as poor separation results as the Lasso (/i) model, 
it was orders of magnitude faster to solve using FISTA-p. We again observed that the dynamic 
scheme for fj, worked better for FISTA-p than for ADAL. For a constant fi over the entire run. 



^We did not use the original version of FISTA to solve the model as an ?i-reguIarization problem because it took 
too long to converge in our experiments due to extremely small step sizes. 
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W^iD '-I Ridge + ElasticNet 




accuracy = 98.18% accuracy = 87.83% accuracy = 87.89% 



Figure 7: Separation results for the video sequence background substraction example. Each training 
image had 120 x 160 RGB pixels. The training set contained 200 images in sequence. The accuracy 
indicated for each of the different models is the percentage of pixels that matched the ground truth. 



Model 


Accuracy (percent) 


Total CPU time (s) 


No. parameter values on reg path 


h/h 


97.17 


2.48e+003 


8 




98.18 


4.07e+003 


6 


h 


87.63 


1.61e+003 


11 


ridge + elastic net 


87.89 


1.82e+002 


64 



Table 3: Computational results for the video sequence background subtraction example. The 
algorithm used is FISTA-p. We used the Matlab version for the ease of generating the images. 
The C++ version runs at least four times faster from our experience in the previous experiments. 
We report the best accuracy found on the regularization path of each model. The total CPU time 
is recorded for computing the entire regularization path, with the specified number of different 
regularization parameter values. 



ADAL took at least twice as long as FISTA-p to produce a solution of the same quality. A typical 
run of FISTA-p on this problem with the best selected A took less than 10 outer iterations. On the 
other hand, ADAL took more than 500 iterations to meet the stopping criteria. 

5.5 Comments on Results 

The computational results exhibit two general patterns. First, the simpler algorithms (FISTA-p and 
ADAL) were significantly faster than the more general algorithms, such as APLM-S. Interestingly, 
the majority of the APLM-S inner iterations consisted of a skipping step for the tests on synthetic 
data and the breast cancer data, which means that APLM-S essentially behaved like ISTA-p in 
these cases. Indeed, FISTA-p generally required the same number of outer-iterations as APLM- 
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S but much fewer inner- iterations, as predicted by theory. In addition, no computational steps 
were wasted and no function evaluations were required for FISTA-p and ADAL. Second, FISTA-p 
converged faster (required less iterations) than its full-linearization counterpart FISTA. We have 
suggested possible reasons for this in Section [3} On the other hand, FISTA was very effective 
for data both of whose dimensions were large because it required only gradient computations and 
soft-thresholding operations, and did not require linear systems to be solved. 

Our experiments showed that the performance of ADAL (as well as the quality of the solution 
that it returned) varied a lot as a function of the parameter settings, and it was tricky to tune them 
optimally. In contrast, FISTA-p exhibited fairly stable performance for a simple set of parameters 
that we rarely had to alter and in general performed better than ADAL. 

It may seem straight-forward to apply FISTA directly to the Lasso problem (43) without the 
augmented Lagrangian frameworkj^ However, as we have seen in our experiments, FISTA took 
much longer than AugLag-FISTA-p to solve this problem. We believe that this is further evidence 



of the 'load-balancing' property of the latter algorithm that we discussed in Section 3.2 It also 
demonstrates the versatility of our approach to regularized learning problems. 



6 Conclusion 

We have built a unified framework for solving sparse learning problems involving group-structured 
regularization, in particular, the h/h- or Zi/^oo-regularization of arbitrarily overlapping groups of 
variables. For the key building-block of this framework, we developed new efficient algorithms based 
on alternating partial-linearization/splitting, with proven convergence rates. In addition, we have 
also incorporated ADAL and FISTA into our framework. Computational tests on several sets of 
synthetic test data demonstrated the relative strength of the algorithms, and through two real- world 
applications we compared the relative merits of these structured sparsity-inducing norms. Among 
the algorithms studied, FISTA-p and ADAL performed the best on most of the data sets, and 
FISTA appeared to be a good alternative choice for large-scale data. From our experience, FISTA- 
p is easier to configure and is more robust to variations in the algorithm parameters. Together, 
they form a flexible and versatile suite of methods for group-sparse problems of different sizes. 
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A Proof of Lemma 3.1 



F{x,y) - F{x,q) > F{x,y) - Cp{x,y, q,Vyf{x,y)) 

= F{x,y)- (j{x,y) + \7yf{x,yf{q-y) 
1 



^^uQ-yf+9iQ)]- (49) 



®To avoid confusion with our algorithms that consist of inner-outer iterations, we prefix our algorithms with 
'AugLag' here. 
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From the optimality of we also have 



1 



igiq) + ^yfi^,y) + -{q - y) = 0. 



(50) 



Smce F{x, y) = f{x, y) + g{y)^ and / and g are convex functions, for any (x, y), 

F{x, y) > g{q) + (y - qf^g{q) + f{x, y) + (y - yfVyfix, y) + (x - xfVJ{x, y). (51) 



Therefore, from ( |49| ), (50), and (51), it follows that 
F(x, y) - F(x, q) > g{q) + (y - qflgiq) + /(x, y) + {y - yfVyf{x, y) + {x - xfVJ\x, y) 

" ' g{q) 



1 



f{x,y) + Vyf{x,y) {q - y) + —\\q - y\ 



{y-qf{lg{q)+Vyf{x,y)) 



1 

2p' 



y|P + (x - x)^V^/(x,2/) 



(y- - 9)^ 
1 .„ 



1 



{q - y) 



1 



k - y\? + {X- x)^Va:fix, y) 



2p' 



\q - y\ 



p- -J 2p' 
^ - l|y - y^) + (S - a;)^V^/(x,y). 



The proof for the second part of the lemma is very similar, but we give it for completeness. 

Fix, y) - F{{p, q)) > F{x, y) - (^f{{p, q)) + g{y) + 7g(x, yf{{p, q)y - V) + \\{P, q)y " vf 
By the optimality of (p, q), we have 

V:./((p,g)) = 0, 



Vy/((p, q)) + ig{y) + ^((p, q)y - y) 



0. 



Since F(x, y) = /(x, y) + ^(y), it follows from the convexity of both / and g and (54) that 
F{x, y) > g{y) + (y - y)^7c;(^, y) + /((P, 'Z)) + (?/ " (P, q)yV^yf{{P. q))- 



Now combining (53), ( |55| ), and ( |56| ), it follows that 

F(x,y)-F((p,(7)) > (y-(p,(?),)'^(79(^,y)5 + V,/((p,g))) 

= {y- {p,q)yf (^^{y - ip,q)y)^ - ^\\{p,q)y-y\\ 



)y-y\ 



2p 



{\\ip:q)y - yf - \\y - yf)- 



B Proof of Theorem 3.1 



(52) 

(53) 

(54) 
(55) 

(56) 



(57) 



Let I be the set of all regular iteration indices among the first k — 1 iterations, and let Ic be its 
complement. For all n S I^y'^^^ = y"'. 

For n G /, we can apply Lemma |3.1| since (21) automatically holds, and ( |19[ ) holds when 
P < zlj)- ( [22] ), by letting (x,y) = (x*,y*), and y = y", we get {p,q) = (x"+-^,y'^), and 



2,9(F(x*,y*) -F(x"+Sy"+i)) > ||y"+i -y*f 



\r-y*f. 



(58) 
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In (20), by letting {x,y) = {x*,y*), {x,y) = (x""*" , y""*" ), we get q = y""*" and 



2piF{x*,y*) - > \\y^+^ - y*f - \\y^+' - y*f + (x* - x"+i)^V,/(x"+\ 

= -||y"+'-y*f., (59) 



since Va;/(3;"'+^ = 0, for n G / by ([54]) and for n G 4 by ([Ts]). Adding ([59]) to ([58]), we get 

2p{2F{x*,y*) - -F(x'^+\y"+i)) > - (60) 

For n E Ic, since Vx/(x""^^, y"^"*^) = 0, we have that (59) holds. Since y""''^ = y", it follows that 
2p{F{x*,y*) - F{x^+\r^^)) > ||y"+i - y*f - \\r - y*f. (61) 



Summing (60) and ( [61^ over n = 0, 1, . . . , /c — 1 and observing that 2|/| + = k + kn, we obtain 

/ k-l \ 



2p[{k + kn)F{x\y*) - J2 F{x^+\r^^) - F{x^+\y^+^) 



(62) 



n=l 



> 



> 



k-l 
n=0 

\\f-y*f-\\f-v*\? 
-\\y'-y*r- 



In Lemma 3.1, by letting (x,y) = (x""^ , ) in (20) instead of (x*,y*), we have from (59) that 



2p(F(x"+\y"+i) -F(x"+\y"+i)) > ||y"+^ -y"+l^ > 0. 



Similarly, for n G /, if we let (x,y) = (x",y") instead of (x*,y*) in (58), we have 
2p(F(x",y") - F(x"+\y"+i)) > ||y"+i - y"f > 0. 
For n G /c, y"^"^ = y"; from ( |18| ), since x""*""*^ = argmin^^ F{x, y) with y = y^ = y""*""^, 

2p(F(x",r)-i^(^"+\y"+'))>o. 



(63) 



(64) 



(65) 



Hence, from ([63]) and ([64| to ([65]), F(x",y") > F(x",y") > F(x"+^y"+l) > F(x"+^y"+l). Then, 
we have 



fc-i 



J^F(x"+i,y"+i) > fcF(x^/),and J] F(x"+\ y"+i) > fc^FC 



x^y'=) 



(66) 



n=0 



Combining ([62]) and ([66]) yields 2p{k + kn){F{x* ,y*) - F{x^,y^)) > -||y" -y 



„,*ll2 



C Derivation of the Stopping Criteria 

In this section, we show that the quantities that we use in our stopping criteria correspond to 
the primal and dual residuals [8] for the outer iterations and the gradient residuals for the inner 
iterations. We first consider the inner iterations. 
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FISTA-p The necessary and sufficient optimality conditions for problem (12) or (15) are primal 
feasibility 

y*-y* = 0, (67) 



and vanishing of the gradient of the objective function at i.e. 

= v,./(x*,r), 

G Vyf{x\y*)+dg{y*). 



(68) 
(69) 



— the primal residual is thus y'^^^ — y^^^ = y^^^ — z^. It follows from the 



Since y _ 

optimality of x'^^^ in Line [s] of Algorithm 3.4 that 



-C^iz'' - y^+^) 



Similarly, from the optimality of y^~^^ in Line|4| we have that 



e dg{f+^) + Vyf{x'+\z') + -{y 



k+l k\ 



-Ml _ ^k. 



P 



= dg{f+') + Vyf{x^+\f+') - -{f^' - z') + - z^) 

H p 

= dg{f+^) + Vyf{x'^+\f+'), 

where the last step follows from p, = p. Hence, we see that ^C'^ {z^ — y''^^) is the gradient 
residual corresponding to (68), while (69) is satisfied in every inner iteration. 



APLM-S The primal residual is y^^^ — y^+i from (67). Following the derivation for FISTA-p, it 

that 



is not hard to verify that (|69|) is always satisfied, and the gradient residual corresponding to 
m is }-C^{y'+' 



y 



FISTA Similar to FISTA-p, the necessary and sufficient optimality conditions for problem (28) 
are primal feasibility 

{x*,y*) = {x*,r), 
and vanishing of the objective gradient at {x*,y*), 

= v,./(x*,r), 

G Vyf{x*,y*)+dg{y*). 



Clearly, the primal residual is (x 



fc+l _ ^.k+l 



zz,y 



z'l) since {x^^^ ,y^^^) = {z^,Zy). From the 



optimality of {x''~^^ , y^~^^) , it follows that 





V./(zf,4) + -(s'=+^-4), 



G dg^yk+^)+Vyf{ztz'y) + -^{f+'-z'y] 



Here, we simply use -{x^~^^ — z^) and -(y'^~*~^ — Zy) to approximate the gradient residuals. 
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Data Sets 


Algs 


CPU (s) 


Iters 


Avg Sub-iters 


F{x) 


ogl-5000-100-10-3 


ADAL 
APLM-S 
FISTA-p 

FISTA 
ProxGrad 


L70e-F000 

L71e-h000 

9.08e-001 

2.74e-h000 

7.92e-F001 


61 

8 
8 

10 
3858 


l.OOe-FOOO 
4.88e-F000 
4.38e-F000 
7.30e-F000 


1.9482e-F005 
1.9482e-F005 
1.9482e-F005 
1.9482e-}-005 


ogl-5000-600-10-3 


ADAL 
APLM-S 
FISTA-p 

FISTA 
ProxGrad 


6.75e-F001 
L79e-h002 
4.77e-h001 
3.28e-H001 
7.96e-h002 


105 

9 

9 

12 
5608 


l.OOe-FOOO 
1.74e-F001 
8.56e-F000 
1.36e-F001 


1.4603e-F006 
1.4603e-F006 
1.4603e-h006 
1.4603e-h006 


ogl-5000-1000-10-3 


ADAL 
APLM-S 
FISTA-p 

FISTA 
ProxGrad 


2.83e-F002 
8.06e-h002 
2.49e-F002 
5.21e-h001 
L64e-h003 


151 
10 
10 
13 

6471 


l.OOe-FOOO 
2.76e-F001 
1.28e-F001 
1.55e-F001 


2.6746e-F006 
2.6746e+006 
2.6746e-F006 
2.6746e-h006 



Table 4: Numerical results for ogl set 1. For ProxGrad, Avg Sub-Iters and F{x) fields are not 
applicable since the algorithm is not based on an outer-inner iteration scheme, and the objective 
function that it minimizes is different from ours. We tested ten problems with J = 100, • • • , 1000, 
but only show the results for three of them to save space. 



Next, we consider the outer iterations. The necessary and sufficient optimality conditions for 
problem ([s]) are primal feasibility 

Cx* -y* = 0, (70) 

and dual feasibility 

= VL{x*)-C^v* (71) 
G dn{y*)+v*. (72) 

, , / VL(x'+i) - C^{v' - i(C7x'+i - y'+i)) 

Clearly, the primal residual is r = Cx —y . The dual residual is | '+^) -)- ' ^ (C '"'"^ 



recalling that w'^^ = — ^(Cx'^^ — y^^^). The above is simply the gradient of the augmented 
Lagrangian Q evaluated at {x\y\v^). Now, since the objective function of an inner iteration is 
the augmented Lagrangian with v = v\ the dual residual for an outer iteration is readily available 
from the gradient residual computed for the last inner iteration of the outer iteration. 
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Data Sets 


Algs 


CPU (s) 


Iters 


Avg Sub-iters 


F(x) 




ADAL 


4.18e+000 


77 


l.OOe+000 


9.6155e+004 




APLM-S 


1.64e-F001 


9 


2.32e-h001 


9.6156e-h004 


ogl-1000-200-10-3 


FISTA-p 


3.85e-|-000 


9 


1.02e-F001 


9.6156e-F004 




FISTA 


2.92e-|-000 


11 


1.44e-F001 


9.6158e-F004 




ProxGrad 


1.16e+002 


4137 








ADAL 


5.04e+000 


63 


l.OOe+000 


4.1573e+005 




APLM-S 


8.42e+000 


8 


8.38e+000 


4.1576e-h005 


ogl-5000-200-10-3 


FISTA-p 


3.96e-|-000 


9 


6.56e-F000 


4.1572e-F005 




FISTA 


6.54C+000 


10 


9.70e-F000 


4.1573e-F005 




ProxGrad 


1.68C+002 


4345 








ADAL 


6.41e+000 


44 


l.OOe+000 


1.0026e+006 




APLM-S 


1.46e+001 


10 


7.60e+000 


1.0026e+006 


ogl- 10000-200- 10-3 


FISTA-p 


5.60C+000 


10 


5.50e+000 


1.0026e+006 




FISTA 


1.09C+001 


10 


8.50e-F000 


1.0027e-h006 




ProxGrad 


3.31e+002 


6186 







Table 5: Numerical results for ogl set 2. We ran the test for ten problems with n = 1000, • • • , 10000, 
but only show the results for three of them to save space. 
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Data Sets 


Algs 


CPU (s) 


Iters 


Avg Sub-iters 


F(x) 




ADAL 


1.14e-h001 


194 


l.OOe-hOOO 


8.4892e-h002 


ogl-dct-1000-5000-1 


FISTA-p 


1.21e+001 


20 


l.lle-l-001 


8.48926-1-002 




FISTA 


2.49e-F001 


24 


2.51e-K001 


8.4893e-F002 




ADAL 


3.31e+001 


398 


l.OOe+000 


1.48876+003 


ogl-dct-1000-10000-1 


FISTA-p 


2.54e-h001 


41 


5.61e-h000 


1.4887e-h003 




FISTA 


6.33e-h001 


44 


1.74e-F001 


1.4887e+003 




ADAL 


6.09e-|-001 


515 


l.OOe-hOOO 


2.75066-F003 


ogl-dct-1000-15000-1 


FISTA-p 


3.95e-F001 


52 


4.44e-|-000 


2.75066-F003 




FISTA 


9.73e-h001 


54 


1.32e-F001 


2.7506e+003 




ADAL 


9.52e-F001 


626 


l.OOe-FOOO 


3.3415e-F003 


ogl-dct-1000-20000-1 


FISTA-p 


6.66e-h001 


63 


6.106-1-000 


3.34156-1-003 




FISTA 


1.81e-F002 


64 


1.61e-F001 


3.34156-F003 




ADAL 


1.54e-h002 


882 


l.OOe-FOOO 


4.1987e+003 


ogl-dct-1000-25000-1 


FISTA-p 


7.50e-F001 


88 


3.20e-F000 


4.1987e-F003 




FISTA 


L76e-h002 


89 


8.64e-F000 


4.19876-F003 




ADAL 


1.87e+002 


957 


l.OOe+000 


4.61116+003 


ogl-dct-1000-30000-1 


FISTA-p 


8.79e-h001 


96 


2.86e+000 


4.6111e+003 




FISTA 


2.24e-F002 


94 


8.54e-F000 


4.61116+003 



Table 6: Numerical results for dct set 2 (scalability test) with Zi/Z2-regularization. All three 
algorithms were ran in factorization mode with a fixed = /xq- 
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Data Sets 


Algs 


CPU (s) 


Iters 


Avg Sub-iters 


F{x) 


ogl-dct-1000-5000-1 


ADAL 
FISTA-p 

FISTA 
ProxFlow 


L53e-F001 
1.61e-h001 
3.02e-h001 
L97e-h001 


266 
10 
16 


l.OOe-FOOO 
3.05e-F001 
4.09e-F001 


7.3218e-F002 
7.3219e-h002 
7.3233e-h002 
7.3236e-h002 


ogl-dct-1000-10000-1 


ADAL 
FISTA-p 

FISTA 
ProxFlow 


3.30e-F001 
3.16e-F001 
7.27e-h001 
3.67e+001 


330 
10 
24 


l.OOe-FOOO 
3.10e-F001 
3.25e-F001 


1.2707e-F003 
1.2708e-F003 
1.2708e-F003 
1.2709e-F003 


ogl-dct-1000-15000-1 


ADAL 
FISTA-p 

FISTA 
ProxFlow 


4.83e-h001 
5.40e-F001 
8.64e-h001 
9.91e-h001 


328 
15 
23 


l.OOe-FOOO 
2.52e-F001 
2.66e-F001 


2.2444e-F003 
2.2444e-F003 
2.2449e-F003 
2.2467e-F003 


ogl-dct-1000-20000-1 


ADAL 
FISTA-p 

FISTA 
ProxFlow 


8.09e-h001 
8.09e-h001 
L48e-h002 
2.55e-h002 


463 
16 
26 


l.OOe-FOOO 
2.88e-F001 
2.93e-F001 


2.6340e-h003 
2.6340e-h003 
2.6342e-h003 
2.6357e-F003 


ogl-dct-1000-25000-1 


ADAL 
FISTA-p 

FISTA 
ProxFlow 


7.48e-h001 
1.15e-h002 
2.09e-h002 
1.38e-h002 


309 

30 

38 


l.OOe-FOOO 
1.83e-F001 
2.30e-F001 


3.5566e-F003 
3.5566e+003 
3.5568e+003 
3.5571e-h003 


ogl-dct-1000-30000-1 


ADAL 
FISTA-p 

FISTA 
ProxFlow 


9.99e-h001 
L55e-h002 
2.60e-F002 
L07e-F002 


359 
29 
39 


l.OOe-FOOO 
2.17e-F001 
2.25e-F001 


3.7057e-h003 
3.7057e-F003 
3.7060e-F003 
3.7063e-h003 



Table 7: Numerical results for dct set 2 (scalability test) with Zi/Zoo-regularization. The algorithm 
configurations are exactly the same as in Table [6j 



[19] D. Goldfarb, S. Ma, and K. Scheinberg. Fast alternating linearization methods for minimizing 
the sum of two convex functions. Arxiv preprint arXiv:0912.4571v2, 2009. 

[20] T. Goldstein and S. Osher. The split bregman method for U-regularized problems. SIAM 
Journal on Imaging Sciences, 2:323, 2009. 

[21] G. Golub and C. Van Loan. Matrix computations. Johns Hopkins Univ Pr, 1996. 

[22] M. Hestenes. Multiplier and gradient methods. Journal of optimization theory and applications, 
4(5):303-320, 1969. 

[23] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of 
the 26th Annual International Conference on Machine Learning, pages 417-424. ACM, 2009. 

[24] L. Jacob, G. Obozinski, and J. Vert. Group Lasso with overlap and graph Lasso. In Proceedings 
of the 26th Annual International Conference on Machine Learning, pages 433-440. ACM, 2009. 

[25] R. Jenatton, J. Audibert, and F. Bach. Structured variable selection with sparsity-inducing 
norms. Stat, 1050, 2009. 



29 



Data Sets 


Algs 


CPU (s) 


Iters 


Avg Sub-iters 


F(x) 




FISTA-p 


1.83e-h001 


12 


2.34e-h001 


8.4892e-h002 


ogl-dct-1000-5000-1 


FISTA 


2.49e-|-001 


24 


2.51e-h001 


8.48936-1-002 




ADAL 


1.35e-F001 


181 


l.OOe-FOOO 


8.4892e-F002 




FISTA-p 


3.16e+001 


14 


1.73e+001 


1.4887e+003 


ogl-dct-1000-10000-1 


FISTA 


6.33e-h001 


44 


1.74e-h001 


1.4887e-h003 




ADAL 


4.43e-|-001 


270 


l.OOe-FOOO 


1.4887e+003 




FISTA-p 


4.29e-F001 


14 


1.51e-F-001 


2.7506e-F003 


ogl-dct-1000-15000-1 


FISTA 


9.73e-F001 


54 


1.32e-F001 


2.7506e-F003 




ADAL 


5.37e-h001 


216 


l.OOe-FOOO 


2.7506e+003 




FISTA-p 


7.53e-F001 


13 


2.06e-F001 


3.3416e-F003 


ogl-dct-1000-20000-1 


FISTA 


1.81e-h002 


64 


1.61e-F001 


3.3415e-F003 




ADAL 


1.57e-F002 


390 


l.OOe-FOOO 


3.3415e-F003 




FISTA-p 


7.41e-h001 


15 


1.47e-F001 


4.1987e+003 


ogl-dct-1000-25000-1 


FISTA 


1.76e-F002 


89 


8.64e-F000 


4.1987e+003 




ADAL 


8.79e-h001 


231 


l.OOe-FOOO 


4.1987e-F003 




FISTA-p 


8.95e+001 


14 


1.58e+001 


4.6111e+003 


ogl-dct-1000-30000-1 


FISTA 


2.24e-F002 


94 


8.54e+000 


4.6111e+003 




ADAL 


1.12e-F002 


249 


l.OOe-FOOO 


4.6111e-K003 



Table 8: Numerical results for the DCT set with Zi/Z2-regularization. FISTA-p and ADAL were 
ran in PCG mode with the dynamic scheme for updating ji. jj, was fixed at ixq for FISTA. 
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Data Sets 


Algs 


CPU (s) 


Iters 


Avg Sub-iters 


F{x) 


ogl-dct-1000-5000-1 


FISTA-p 
ADAL 
FISTA 

ProxFlow 


2.30e-h001 
1.89e-F001 
3.02e+001 
1.97e+001 


11 

265 
16 


2.93e-h001 
l.OOe-hOOO 
4.09e+001 


7.3219e-h002 
7.3218e-F002 
7.3233e+002 
7.3236e+002 


ogl-dct-1000-10000-1 


FISTA-p 
ADAL 
FISTA 

ProxFlow 


5.09e-h001 
4.77e-F001 
7.27e+001 
3.67e+001 


11 
323 

24 


3.16e-h001 
l.OOe-FOOO 
3.25e-F-001 


1.2708e-h003 
1.2708e+003 
1.2708e-F003 
1.2709e+003 


ogl-dct-1000-15000-1 


FISTA-p 
ADAL 
FISTA 

ProxFlow 


6.33e-h001 
9.41C-F001 
8.64C+001 
9.91(;+()01 


12 
333 
23 


2.48e-F001 
l.OOe-FOOO 
2.66C+001 


2.2445e+003 
2.2444e+003 
2.2449C+003 
2.2167O+003 


ogl-dct-1000-20000-1 


FISTA-p 
ADAL 
FISTA 

ProxFlow 
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12 
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26 
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ogl-dct-1000-25000-1 


FISTA-p 
ADAL 
FISTA 

ProxFlow 
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L20e-F002 
2.09e+002 
1.38e+002 


13 
310 
38 


2.98e-F001 
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2.30e-F001 
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3.5568e-F003 
3.5571e+003 


ogl-dct-1000-30000-1 


FISTA-p 
ADAL 
FISTA 

ProxFlow 


1.75e-h002 
2.01e-F002 
2.60e-h002 
1.07e+002 


13 

361 
39 


3.18e-F001 
l.OOc+000 
2.25e-F001 


3.7057e-h003 
3.7057e+003 
3.7060e-F003 
3.7063e+003 



Table 9: Numerical results for the DCT set with Zi/Zoo-regularization. FISTA-p and ADAL were 
ran in PCG mode. The dynamic updating scheme for // was applied to FISTA-p, while was fixed 
at Ho for ADAL and FISTA. 
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Data Sets 


Algs 


CPU (s) 


Iters 


Avg Sub-iters 


Fix) 




ADAL 


6.24e+000 


136 


l.OOe+000 


2.9331e+003 




APLM-S 


4.02e+001 


12 


4.55e+001 


2.9331e+003 


BreastCancerData 


FISTA-p 


6.86e+000 


12 


1.48e+001 


2.9331e+003 


FISTA 


5.11e+001 


75 


1.29e+001 


2.9340e+003 




ProxGrad 


7.76e+002 


6605 


l.OOe+000 





Table 10: Numerical results for Breast Cancer Data using Zi//2-regularization. In this experiment, 
we kept /u constant at 0.01 for ADAL. The CPU time is for a single run on the entire data set with 
the value of A selected to minimize the RMSE in Figure |4j 
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